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Summary 

One of the most important techniques with the highest potential for increasing the 
quality and scope of future television systems and equipment is motion compensation. 
Motion compensation relies on the accurate determination of the velocities of objects in the 
image plane so that any movement can be accounted for in subsequent image processing 
operations. 

This Report describes experimental equipment designed and built at BBC Research 
Department for undertaking such velocity measurement; it forms one part of a complete 
real-time video motion estimation system and is based on the phase correlation method of 
motion vector extraction. The equipment described here implements the phase correlation 
itself 

The design of the equipment is discussed and some results obtained from it are 
presented 



Index terms: correlation; digital signal processing; equipment; motion; 
motion estimation; spatial frequency; video signals 
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1. INTRODUCTION 



Motion vector measurement is a topic of great 
interest in modern image processing. It has many 
different applications, including bandwidth reduction, 
slow motion smoothing, standards conversion, display 
rate up-conversion and film motion improvement. 
BBC Research Department has been working on a 
technique'' which splits motion estimation into two 
parts. In the first stage, a global measurement is taken 
of the motion vectors in the scene, without attempting 
to determine which areas are moving with any given 
vector. In the second stage, particular vectors are 
assigned to each area of the scene using trial vector 
values derived in the first stage. This two-stage process 
results in a relatively short list of trial vectors as 
compared with other techniques, and a correspondingly 
reduced hardware complexity. 

The algorithm development was undertaken in 
software using the Department's Image Processing 
facilities. A good motion estimation method could 
therefore be produced before real-time hardware 
design began. This meant that the translation into 
circuitry could be undertaken with a high level of 
confidence in the end result. 

This Report will describe the hardware design 
used to accomplish the first stage of the estimation 
process — the vector measurement — using a phase 
correlation algorithm^. The design of the assignment 
section of the system will be described in a separate 
Report. 



2. THE THEORY OF THE PHASE 
CORRELATION ALGORITHM 

Phase correlation is a process which can be 
described very simply in mathematical terms. In 
summary it comprises the following steps: 

1. Take the two-dimensional spatial Fourier 
transforms of the luminance components of 
two successive images. 

2. Form a cross power spectrum from the output 
of the two transforms and normalise it at each 
spatial frequency. 

3. Pass the unit length vectors resulting from 
step 2 to an inverse Fourier transform, to 
produce a 'phase correlation surface'. 



If one of the images is a cyclically shifted 
version of the other, the Fourier shift theorem tells us 
that the correlation surface is a delta function located 
at the displacement point^. The coordinates of the 
delta function indicate the velocity of one block 
relative to the other. For example, an image cyclically 
shifted by five pixels (i.e. five horizontal positions) and 
phase correlated will produce a delta function at co- 
ordinates (5,0) in the correlation surface. This relation 
holds very well when the data represents part of a 
continuous scene and has only a small amount of 
overlap. Thus, when we have translational (but not 
cyclical) motion, distinct peaks are still produced in 
the correlation surface. By extension, when there is 
more than one motion in the image more than one 
peak is produced in the surface. For example, an 
object moving over a stationary background generates 
two peaks; one at coordinates coincident with the 
object's velocity, and the second at the velocity of the 
background (which in this example is zero, 
corresponding to a peak at the origin). To generate 
trial vector values for the assignment hardware, all 
that needs to be done is phase correlation followed by 
the location of the different peaks. 

This simple mathematical description belies a 
fairly major computational task. At first glance approxi- 
mately seven hundred million operations per second are 
needed to implement the transforms and cross spect- 
rum. (An operation is taken to mean a data multipUca- 
tion or accumulate action). This does not include the 
generation of the complicated transform address and 
coefficient sequences. Fortunately, digital signal process- 
ing techniques can reduce the number of calculations 
by a factor of two. Parallelism can also be employed 
to spread the computations over a number of separate 
units thus further reducing the processing power 
necessary within each individual calculating element. 

3. HARDWARE REALISATION 

The hardware constructed is a fairly direct 
translation of the algorithm described above. Hence 
there are three major processing blocks: 

1. Fast Fourier transform (FFT) the incoming 
video. 

2. Extract phase information and form difference 
array. 

3. Reverse FFT the array of phase difference 
vectors. 
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The only significant departure from the scheme 
described in the previous section is that the normalised 
cross power spectrum has been replaced by a phase 
extraction process. The end result is mathematically 
similar but has some hardware advantages (see 
Section 5.4). 

The two FFTs are essentially the same. The 
input to the forward transform will be a real signal 
(the incoming image luminance data) whilst its output 
will be complex. Conversely, the input to the reverse 
transform will be complex and its output real. It is 
possible to exploit these data properties to reduce the 
amount of hardware required by a factor of two; the 
FFT of two real signals may be performed 
simultaneously if they are scrambled together to form 
a complex input^. The individual transforms may be 
separated at the output using the symmetry inherent in 
the FFT (see Appendix 2). 

There are many different variants of the FFT 
and its implementation. However, the constraints on 
memory organisation, address generation, processing 
speed, accuracy, etc. mean that relatively few of them 
are feasible in real-time hardware. The algorithm 
selected for the equipment described in this Report 
was an in-place, decimation-in-time, radix-2 fast 
Fourier transform'*' '*''''■ \ The block size was program- 
mable provided it was a power of two and contained 
no more than 8192 complex data points. Transforms 
of more than 512 points (in a single dimension) could 
be undertaken if the data were to be re-arranged from 
a large one-dimensional array into a two-dimensional 
array. This constrains the addresses and coefficients 
required. Two-dimensional transforms were carried out 
using a row/column decomposition technique, and one- 
dimensional operation could be selected if required. 

The hardware produced a field's worth of 
correlation surfaces in a picture period. Interlaced 
sources have a vertical and temporal offset between 
successive fields. This makes peak hunting much more 
difficult if the phase correlation is performed at field 
rate. Thus, ail calculation was done between successive 



field 1 input data planes, field 2 being ignored. Such a 
simplification was extensively simulated by image 
processing prior to undertaking the hardware design, 
and was found to cause no significant degradation in 
performance. 

The transform block size must be chosen to 
match the parameters of the motion vectors it is 
wished to extract. The larger the block size, the higher 
the maximum velocity which may be measured. How- 
ever, only a few peaks may be accurately detected in 
any given block due to noise. So, unless the intention 
is to look for global motion such as camera panning, a 
fairly large number of blocks are required to cover a 
field. Computer simulations^ showed that a measure- 
ment block size of 64 pixels by 32 field lines yielded a 
good compromise and all work described here used 
these dimensions. The maximum possible velocity that 
can be estimated with this block size is ±31 pixels 
and ±15 field lines (±31 picture lines) per picture 
period. This is equivalent to somewhat less than one 
second per picture width/height. 



4. DATA PATH THROUGH THE SYSTEM 

A block diagram of the system is shown in 
Fig. 1. The incoming video data is fed into a field 
delay and then into an input buffer. The field delay is 
included so that half the processing can be carried out 
during the field 1 time slot and the remainder in the 
field 2 slot. The input buffer board windows the data 
to reduce spectral leakage as desired and then 
multiplexes the digital video to the real or imaginary 
data bus as dictated by the scrambling controls 
described below. 

The data is then routed to one of three 
identical 'butterfly' processor boards which perform 
the FFTs. The fundamental arithmetic element of an 
FFT is known as a butterfly after its schematic 
representation, shown in Fig. 2 where A and B are 
data inputs and Wn is a transform coefficient (A, B 
and Wn are complex). 
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Fig. 1 - System block diagram. 
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Fig. 2 - A butterfly. 



Once the FFT is complete the spatial frequency 
data is sent back to the data bus and enters the phase 
extraction processing. Here, the FFT output is 
unscrambled (see Appendix 2) to separate the 
individual transforms which are subsequently converted 
to amplitude and phase format. Phase data from the 
previous frame is used to produce phase difference 
values which are in turn converted into unity length 
vectors via a look-up table. The amplitude data is only 
used indirectly (see Section 5.4). The vectors are 
rescrambled for transformation and passed to the 
reverse FFT butterfly processors. These are identical to 
the forward FFT processors but use a different set of 
coefficients. Once again there are three boards 
operating in parallel. When the reverse FFT is 
complete the correlation surface data is sent to a 
buffer which selects either the real or imaginary data 
bus as the output stream. The selected data moves on 
to the peak locating hardware which is housed in a 
separate rack. 



The peak hunting is carried out by a number 
of general-purpose high speed microprocessor boards 
working in parallel. The positions of the peaks they 
extract are interpolated to within a small fraction of a 
pixel/field line and so the extracted vectors have 
fractional pixel accuracy. This is very important in 
motion compensated image processing applications, 
such as HDTV bandwidth reduction, where integer 
pixel accuracy is not good enough. 



5. DETAILS OF THE FFT 

5.1 FFT processors 

A block diagram of the butterfly processor 
boards, which undertake both the forward and reverse 
transforms, is shown in Fig. 3. 

The principle of operation is straightforward; 
however a large number of integrated circuits are 
required. This in turn necessitates a large multi-layer 
circuit board to accommodate all the components — 

see Fig. 4. 

The board has four separate banks of high 
speed random access memory (RAM). At any one 
time each of these banks will have a different role. 
One RAM will be an input buffer, taking data in from 
the outside world and storing it. A second bank will 
be an output buffer, holding data ready to be sent 
out to the external data busses. The third RAM 
will act as a source of data for the transform and will 
be providing data to the input of the butterfly 
arithmetic ICs. Finally, the fourth RAM will act as a 
destination for data emerging from the output of the 
butterfly. 

The RAM banks are switched around as two 
pairs. When transform and I/O operations are 



Fig. 3 - Butterfly board block diagram. 
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Fig. 4 - Butterfly board. 



complete, the RAMs which were acting as the I/O 
buffers become the transform source and destina- 
tion buffers and vice versa. The RAMs surrounding 
the butterfly are also switched at the end of each 
stage of the transform. Thus data which emerged 
from the butterfly on one pass becomes the source 
data for the next pass. This is known as ping-pong 
operation. 

Each RAM bank has its own external address 
bus whilst input and output data is multiplexed onto a 
single complex data bus. Each of the butterfly boards 
is allocated a unique address and only responds to 
I/O instructions when the correct code is present. 
Hence a number of them may be multiplexed onto a 
single data bus. The butterfly calculations are 
performed by a complex multiplier IC in conjunction 
with two complex accumulator ICs. All calculations 
are done using fixed point arithmetic. The input 
video is quantised to eight bit accuracy, but sixteen 
bit processing is used throughout the FFTs. This 
provides sufficient dynamic range for phase 
correlation. The magnitude of the data input to the 
butterfly boards is controlled by the inpui buffer to 
ensure that overflows cannot occur during the first 
stage of the transform. Thereafter, the scaling ensures 
that intermediate results do not overflow. The control 
lines which determine the amount of scaling at each 
stage of the transform are available at the edge 
connector so a more sophisticated regime could be 
instituted at a later date should there be any advantage 
in so doing. 

Each butterfly processor has externally 
addressed non-volatile memory containing all the 
transform coefficients for any power-of-two transform 
up to 1024 points. 



5.2 Address generation 

Each of the RAM banks on the butterfly 
processors requires a different address sequence 
according to its function. That is, whether it is being 
used for I/O or transform calculation. The most 
obvious way to achieve this is to have four dedicated 
sequence generators (input, output, transform source 
and transform destination) and to switch the sequences 
around to the different RAM banks as their function 
changes. The switching control reduces to a simple 
state machine model; the next switch state can always 
be deduced from the current state of the switches and 
other control inputs. For example, if RAM bank 1 is 
currently the input buffer, when the I/O and 
transform operations are complete it will become the 
transform source RAM as it will contain fresh video 
data. 

To simplify things it is possible to reduce the 
number of sequence generators to two; the use of an 
in-place algorithm means that the transform data is 
returned to the address it was taken from. Thus, the 
transform destination address is simply the source 
address delayed by the latency of the butterfly 
processor. Also, a single counter circuit, locked to the 
incoming synchronisation signal, can be used to derive 
both the input and output address sequences. Strictiy 
speaking, it is not necessary to have both the input 
and output sequences programmable — one of them 
can be a simple ramp. However, with control of both, 
the I/O can be mapped into any desired order. Thus 
it is possible to arrange for the frequency domain data 
to have the 'DC term in the centre with higher spatial 
frequencies at increasing distances from the origin. 
This makes the output more easily understood when 
viewed directly — for example on a monitor. 
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In order to complete a 64 by 32 point FFT, 
over 22500 distinct data addresses are needed. 
However, the amount of memory needed to store the 
transform addresses can be reduced to about 550 
locations if the row or column number of the data 
being processed is appropriately combined with the 
stored transform address. For example, when the data 
is being processed in rows, each stage of the actual 
transform is exactly the same for every row of data. 
The only difference between the addresses required for 
each of the rows is that the data is shifted by an 
amount corresponding to the length of a row. 
Effectively, the location of the first pixel in each row 
of data needs to be added to the base transform 
address. Therefore, the row number can be used as the 
most significant part of the address whilst the stored 
base transform address represents the least significant 
part. When in column mode an analogous situation 
holds. In this case the column number is used as the 
least significant part of the word and the 
stored address forms the most significant section. 
This arrangement means that the stored column 
addresses are in exactly the same form as the stored 
row addresses, which is convenient when generating 
them. 

5.3 Computer control 

The address generation circuitry can be 
controlled from a personal computer, and an Atari ST 
was used for all of the development work. This 
allows the operator to change the size of the 
transforms, the number of stages, the transform 
coefficients and addresses from the computer. This 
facility was found to be invaluable for system 
commissioning; with a recursive algorithm such as the 
FFT it is extremely difficult to isolate any problems 
without total control of the processing. Also, the 
control waveforms (the 'housekeeping' signals) to all 
boards can be loaded from the computer. This is a 
useful facility as it allows the system to be 
reconfigured by downloading new housekeeping files. 
Once de-bugging was complete the various computer 
files were transferred to non-volatile memory devices 
so that the hardware was ready to work at power-on 
without a boot-up procedure. 

5.4 Phase extraction 

The amount of hardware required for 
calculation of the normalised cross power spectrum 
can be reduced by replacing it with phase differencing. 
In mathematical terms it can be shown that these 
operations are similar, as follows: 

If the two discrete Fourier series are Fi{m,n) 
and F2(m,n) then the normalised cross power 
spectrum P{m,n) is formed as: 



P{m,n) = 



Fi{m,n).F2*{m,n) 



Fi(m,n).F2*(m,n) | 
where the asterisk indicates conjugation. 

The amplitude of P(m,n) is therefore given by: 
Fi{m,n) I . I F2*im,n) 



P(m,n)\ = 



Fi{m,n).F2*{m,n) 



= 1 



and the phase by: 

tP{m,n) = ^ F,{m,n) + tF2%m,n) 
so t P{m,n) = t Fi(m,n) - ^F2(m,n) 

Thus it is possible to implement the normalised 
cross power spectrum by finding the phase of each 
spatial frequency term and calculating the difference 
value. Problems can occur when there are zero 
amplitude components. However, in this instance the 
phase angles have no real meaning so these phase 
differences can be trapped by setting the output vector 
to zero — rather than unity — amplitude. 

In the hardware, the phase and amplitude of 
the current spatial frequency terms are calculated and 
the phase sent to a frame store. The output from that 
frame store will then be the phase extracted from the 
previous frame of spatial frequency components. Thus 
the old phase value can be subtracted from the new 
phase value to effect the normalised cross correlation. 
This technique removes the need for a complex data 
frame store and a further complex number multiplier. 
Although it involves two potentially noisy inverse 
tangent estimates any problems may be ameliorated by 
exploiting the properties of phase correlation; firstly 
the amplitude of the vectors generated from the phase 
difference values can be weighted according to some 
strategy designed to improve the signal to noise ratio. 
For example, it is possible to down-weight vectors 
produced by low amplitude frequency domain terms 
(including zero amplitude as mentioned above). 
Alternatively, other schemes could be used, such as 
making the weighting dependent on the noise level of 
the input video or the number of trial vectors available 
in the assignment hardware. Secondly, a reduced 
number of bits of phase information may be used in 
the differencing calculation. 

6. RESULTS 

There are two main ways to display the 
correlation surfaces for performance analysis. The first 
is to capture them using the peak hunting 
microprocessors, transfer them to a general purpose 
computer (a DEC Micro VAX in the case of the 
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results presented here) and convert the data into a 
contour plot. The second method is to pas the 
correlation surfaces through a digital to analogue 
coaverter and display them on a monitor where the 
brightness of the display will represent the height of 
the surface. 

The results outlined below were obtained from 
a camera pointed at a circular drum covered with a 
painted scene. The dram could be rotated over a wide 
range of speeds using a remote control device. It was 
thus possible to simulate camera panning with a speed 
set by the rate of drum rotation. This provided a 
source of moving pictures under controlled repeatable 
conditions. The camera luminance output was digitised 
and used as the video input to the correlation 
hardware. 

Fig. 5 shows a captured surface for a still scene 
and Fig. 6 shows the corresponding surface when the 
scene was panning horizontally at 8 pixels per 40 ms 
(a picture period). The surface peak is sharp and 
unmistakable in both cases and the peak hunting 
hardware had no problem correctly locating it. 



Note that the diagrams only represent the 
central portion of the correlation surfaces and do not 
extend to the edge of the measurement blocks. This is 
because the p&sk hunting microproc^sors rratrict them- 
selves to capturing the part of the surface which the 
assignment modules, employed in the second stage of 
the motion measurement process, can use. The 
examples shown here have a range of 16 lines by 50 
pixels, set by the storage limit provided on the motion- 
compensated interpolators. This limit can be extended 
through the use of larger memory devices on the 
interpolators and it is planned to do this in the future. 

Examination of the error signals produced by 
the assignment system showed that the phase 
correlation and peak hunting hardware combine to 
produce vectors that are accurate to around one eighth 

of a pixel per frame period for panning scenes such as 

the spinning drum. 

To generate an informative display using a 
monitor as the output device, a composite signal was 
constructed from three elements. These were, the incom- 
ing video signal, the correlation surfaces themselves 




Fig. 5 
Correlation surface for stationary source. 



Fig. 6 

Correlation surface for source moving at 

8 pixels/40 ms. 
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and a reference grid. The output surfaces were aligned 
with the input video to which they related. 

The reference grid is a useful aid in the 
analysis of the correlation surfaces. It is effectively a 
pair of axes representing horizontal and vertical 
velocity, the point of intersection between them being 
zero velocity. Thus, this point is at the centre of the 
measurement block to which it relates. (See Fig. 7). 

These grid lines are necessary to avoid the 
correlation surface information being confused by the 
underlying video source. For example, with camera pan- 
ning it is very difficult to see the surface peak move- 
ment without the grid being present, because the eye is 
distracted by the moving video. With the grid in place 
the eye is given a fixed reference point to anchor itself. 

Fig. 8 and Fig. 9 illustrate a small part of the 
monitor screen output corresponding to the surfaces 
shown in Fig. 5 and Fig. 6. 

It can be seen that the peaks have moved 
horizontally away from the zero velocity point for the 
moving source. As the speed of panning is increased 
so the peaks move further from the block centre and 
their height decreases. 

Fig. 10 also shows the surfaces for a speed of 
8 pixels per field period but the video has been 
'grabbed' into a frame store to freeze the movement. 
Note that positive velocity is defined as motion down 
and to the right. Thus when an object moves to the 
right the peak moves to the right but when an object 
moves down the peak moves up. 

Fig. 11 illustrates a correlation surface for a 
diagonally moving scene. Here, of course, the surface 
peak moves diagonally away from the grid origin. 
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Fig. 12 shows the output when the pan speed 
has been increased to 25 pixels per picture period. The 
peak height has been reduced and its profile 
broadened. 

However, when a camera shutter of 1 /250th 
of a second is used, the output is modified to that 
shown in Fig. 13 and its grabbed version Fig. 14. 
Peak height is somewhat restored and the width 
reduced. This shows that camera integration plays a 
large part in the reduction of peak height by 
diminishing the amount of detail in the scene. 

The system works surprisingly well even with 
large amounts of uncorrelated information entering the 
measurement block. Indeed, if a fast shutter is 
introduced, velocities approaching the theoretical 
maximum can be tracked. It should be noted however 
that, even in areas containing no motion, the height of 
the peak is related to the image content to a certain 
extent. In areas without any detail, the peak height is 
reduced relative to those in more active parts of the 
scene. 

In Fig. 15 a stationary object has been placed 
in front of the moving background. In the resulting 
correlation surface, two peaks are visible; one at zero 
velocity corresponding to the fixed object and the 
other at the pan speed of the moving background. For 
clarity, the same results are presented in Fig. 16 
without the source video or grid. The double peaks 
can be clearly discerned. Note that their relative 
heights are partly determined by the proportion of the 
measurement block that is moving/stationary. 

The contour plot for a single block obtained 
under the same conditions as Figs. 15 and 16 is shown 

in Fig. 17. 



(PH-301) 



Fig. 8 - Correlation surface for stationary source. 



Fig. 9 - Correlation surface for source moving at 
8 pixels/40 ms. 



Fig. 10 - 'Grabbed' version of Fig. 9. 



Fig. 11 - Correlation surface for diagonally moving scene. 



Fig 12 



Correlation surface for scene moving at 
25 pixels/40 ms. 



Fig. 13 - Correlation surface for scene moving at 
25 pixels/40 ms (shutter speed 1 /250th of a second). 
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Fig. 14 - 'Grabbed' version of Fig. 13. 



Fig. 15 - Correlation surface for stationary and moving 
objects. 



Fig. 16 - Correlation of Fig. 15 without the source video. 




Fig. 17 - Contour plot of correlation surface containing 
double peaks. 
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7. DISCUSSION 

The results of the previous section demonstrate 
that the hardware works well on a wide range of 
source material and object velocities. There are 
however some areas in which its performance could 
be further improved. Probably the most important 
addition would be the use of an overlap between 
adjacent blocks of data used to form the different 
correlation surfaces. This would allow the implementa- 
tion of a high performance window without worrying 
about vignetting, i.e. moving objects are not lost when 
they enter areas of low window gain. The window 
would improve the performance on source material 
which produces spectral leakage. Spectral leakage 
occurs when the signal being transformed contains 
significant information at spatial frequencies which do 
not have an integer number of cycles in the transform 
period. In this situation information from one trans- 
form frequency bin will 'leak' into another. Application 
of a window modifies the shape of the bins so as to 
reduce the response at frequencies which would other- 
wise cause leakage. It is straightforward to give the 
system overlap in the horizontal direction. This can be 
done by stretching the active line and using repeated 
pixels in the block overlap regions. However, to obtain 
vertical overlap some active picture lines would have 
to be dropped due to lack of processing time. 

In the period since hardware design and con- 
struction began, a new generation of integrated circuits 
has been produced. If the system were to be redesigned 
using the newer chip-sets the design complexity would 
be considerably reduced. The later FFT-specific 
semiconductor devices will generate all the addresses, 
coefficients and switching waveforms necessary to 
implement the transform. This would give a size 
reduction by a factor of at least three to four (perhaps 
much more) and yet may give improved performance 
through the use of more sophisticated scaling tech- 
niques. The same order of size reduction can also be 
applied to the other sections of the motion estimator 
(the peak extraction and vector assignment — not 
described in this Report) so the final system could be 
reduced in size to a single rack or less. If applications- 
specific integrated circuits designed for the motion 
estimator are incorporated there is scope for even more 
size reduction, thus further enhancing the range of 
applications in which motion compensation is feasible. 

The equipment operates at a sample rate of 
13.5 MHz. However, computer simulations have 
shown^'® that it may be used with down-filtered High 
Definition Television (HDTV) signals. Since vectors 
are calculated to sub-pixel resolution, the accuracy is 
not compromised and the hardware is considerably 
simplified. The down-filter need only be of simple 
design — that is, a few taps within a single HDTV 



field; indeed if more complex down-filters, which have 
contributions from more than one field, are used more 
than one correlation peak can be produced. The 
vector assignment may also be done with down- 
filtered signals. This means that the hardware could be 
used in an HDTV bandwidth compression system 
requiring motion vectors without the need to extend 
its size or clock frequency. 



8. CONCLUSIONS 

This Report has described hardware for 
performing real-time phase correlation. The equipment 
has been built and performs well on a wide range of 
input material and object velocities and verifies 

computer simulations of this algorithm. The hardware 
can be used in almost any application requiring 
motion vector measurement and will thus be a 
valuable tool in the development of such systems. 
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APPENDIX 1 
The Fast Fourier Transform Algorithm 

The FFT algorithm implemented in the phase correlation hardware was a decimation-in-time, radix-2, 
in-place, two-dimensional transform using row/column decomposition. These terms are explained below: 

Decimation-m-time (DIT) refers to the fact that the input data sequence is successively spHt into smaller 
sections for processing. In hardware terms the only major difference between the DIT version and its alternative 
(decimation-in-frequency — DIF) is that the complex multiply comes before the addition/subtraction. Hence, the 
fundamental butterfly shape has been determined. 

Radix-2 refers to the number of data points operated on in each butterfly, in this instance two. Higher 
radices can be more efficient in terms of the number of butterflies to be performed and hence clock cycles required 
in total. However, in this application, they would require additional hardware which was not desirable. 

In-place means that data is returned to the location it was taken from after passing through the butterfly 
arithmetic. This can be a useful source of simplification when generating transform address sequences. 

These points are illustrated in Fig. Al. This is a signal flowgraph for an 8-point transform*' ^ ^®^. It can be 
seen that, at each stage of the transform, data returns to the place it was taken from. It can also be seen that the 
data does not emerge in the same order as it was entered. The output indices are bit reversed. That is, if the data 
indices are written in binary, the bit ordering of the output indices is the reverse of those at the input. This means 
that some re-ordering of the data may be necessary at the output stage. 




x(7). 

Fig. Al - Signal flowgraph for an 8-point transform 

Row/coliimn decomposition means that the two-dimensional transform is implemented by first taking 
one-dimensional transforms along all the rows of data and then transforming similarly down all the columns. Note 
that at the end of each stage of the transform it is necessary to switch the roles of the source and destination 
RAMs. An overhead equal to at least the butterfly arithmetic latency is incurred at each switch. Thus, to minimise 
lost processing time one should minimise the number of swaps. This is done by completing each stage of the 
transform for the entire set of rows and columns before a switch rather than completing each individual transform 
before moving on to the next row/column. The total number of switches then becomes equal to the sum of the 
number of stages in the two transforms plus the number of stages in the column transforms minus one. 

In the hardware actuafly constructed further manipulation of the address sequencing was employed. Bit 
reversal compensation was carried out but the data was also mapped into a more natural order. Element (0,0) — 
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the 'DC term — was placed at the centre of the output block with positive frequencies in ascending order to the 
right and negative frequencies in order to the left. This was useful both for commissioning the equipment and for 
performance analysis. The mapping allows the correlation surface peaks to move in a more easily understood 
fashion when viewed over a moving video source. Also, the peak hunting hardware does not have to re-order the 
data before it can begin its peak search pattern. This is useful if the shuffling would otherwise have to be done in 
software with an attendant time penalty. The address sequences must also allow for scrambling/unscrambling of 
the data; the amount of hardware required can be reduced by a factor of two if one takes into account the fact 
that the incoming and outgoing signals are real — see Appendix 2. 



APPENDIX 2 
Simultaneous Transform of Two Real Functions 

Consider two real time-series h(x,}>) and g(x,y) where x and 7 are integers such that 

^ X < M 

^ y < N 

define 

DFT(h(x.y)) = H(mn) 

DFT(^(x,/)) = G(m,n) 

where DFT indicates the discrete Fourier transform and 

^ m < M 
^ n < N 

If a new complex series u(x,y) is formed so that 

u(x,y) = h{x,y) + ig{x,y) 

Again, we can define 

DFT(m(x,/)) = U{m,n) 

From the linearity property 

U{m,n) = H{m,n) + jG(/M,«) 

U{m,n) = iHK{m,n) + iHi(m,n)) + j(GR(m,«) + jGi(m,n)) 

where the subscript R imphes the real part and the subscript I implies the imaginary part of the associated signal. 

So 



U(m,n) = (HRim,n) - C?i(m,n)) + }(Hi(m,n) + GK(m,n)) 



and therefore 



UR(m,n) = HK(m,n) - Gi(m,n) (1) 

Ui(m,n) = Hi{m,n) + GR(/n,«) (2) 
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For a real input signal the DFT is Heraiitian so that 

HKimn) = HdM-mN-n) 
Hiimn) = -Hi{M-mN-n) 
GRimn) = GKiM-m,N-n) 
Gi{m,n) = -Gi{M-m,N-n) 

From (1) and (2) 

UKiM-m,N-n) = HR(M-mN-n) - Gi{M-m,N-n) 
Ui{M-m,N-n} = Hi{M-m.N-n) + GdM-m,N-n) 

Thus 

UdM-mN-n) = ifR(OT,«) + Gi{m,n) (3) 

Ui{M-m,N-n) = -Hi{mn) + GK{m,n) (4) 

Combining (1), (2), (3) and (4) 

fl-R(m,«) = 0.5 {UK{m,n) + UK{M-m,N-n)) (5) 

Hi{m,n) = 0.5 {Ui{m,n) - Ui(M-m,N-n)) (6) 

GR(m,«) = 0.5 (Uiinn) + Ui{M-m.N-n)) (7) 

Giimn) = Q.5{-UK{m,n) + UdM-m,N-n)) (8) 

This is implemented in the phase correlation hardware as follows. The complex series U is formed by sending the 
incoming video data to the real and imaginary busses as appropriate; i.e. one complete block is input on the real 
bus and one complete block is input on the imaginary bus. Thus one block corresponds to the sequence h{x,y) and 
the other to the sequence g{x,y). The complex data is transformed using an FFT and the spatial frequency terms 
are output in pairs for the sequences U{m,n) and U{M—m, N—n). On the phase extraction board, the addition and 
subtraction necessary to separate the two sequences H{m,n) and G(m,n) is carried out in the manner shown in 
equations 5 to 8. 

An analogous but simpler situation holds for the reverse Fourier transform. In this case we know that the complex 
input signal (the array of phase difference vectors) transforms to a real correlation surface. Thus if the complex 
vector inputs are V{m,n) and W(m,n) a new complex series Z(m,n) is formed as 

Z{m,n) = V{m,n) + j W{m,n) 

At the output of the reverse FFT z{x,y) is obtained where 

z{x,y) = v{x,y) + }w{x,y) 

But v{x,y) and w{x,y) are real signals. Therefore separation of the two is trivially achieved by looking in the real 
and imaginary stores as appropriate. 
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